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Abstract. Computing the LZ factorization (or LZ77 parsing) of a string is a computational bottleneck 
in many diverse applications, including data compression, text indexing, and pattern discovery. We 
describe new linear time LZ factorization algorithms, some of which require only 2nlogn + O(logn) 
bits of working space to factorize a string of length n. These are the most space efficient linear time 
algorithms to date, using nlogn bits less space than any previous linear time algorithm. The algorithms 
are also practical, simple to implement, and very fast in practice. 

1 Introduction 

In the 35 years since its discovery the LZ77 factorization of a string — named after its authors 
Abraham Lempel and Jacob Ziv, and the year 1977 in which it was published — has been applied 
all over computer science. The first uses of LZ77 were in data compression, and to this day it lies are 
the heart of efficient and widely used file compressors, like gzip and 7zip. LZ77 is also important 
as a measure of compressibility. For example, its size is a lower bound on the size of the smallest 
context-free grammar that represents a string [TJ. 

In all these applications (and most of the many others we have not listed) computation of 
the factorization is a time- and space-bottleneck in practice. Our particular motivation is the 
construction of compressed full-text indexes |13j . several recent and powerful instances of which are 
based on LZ77 [615112] . 

Related work. There exists a variety of worstcase linear time algorithms to compute the LZ fac- 
torization |2|3|4|7|H] . All of them require at least 3n log n bits of working spac^] in the worstcase. 
The most space efficient linear time algorithm is due to Chen et al. [2]. By overwriting the suffix 
array it achieves a working space of (2n + s) log n bits, where s is the maximal size of the stack used 
in the algorithm. However, in the worstcase s = 0(n). Another space efficient solution requiring 
(2n + y/n) log re bits of space in the worstcase is from [I] but it computes only the lengths of LZ77 
factorization phrases. It can be extended to compute the full parsing at the cost of extra nlogn 
bits. 

All of these algorithms rely on the suffix array, which can be constructed in O(n) time and using 
(1 + e)nlogn bits of space (in addition to the input string but including the output of size nlogn 
bits) [9j. This raises the question of whether the space complexity of linear time LZ77 factorization 
can be reduced from 3nlogn bits. In this paper, we answer the question in the affirmative by 
describing a linear time algorithm using 2nlogn bits. 

In terms of practical performance, the fastest linear time LZ factorization algorithms are the 
very recent ones by Goto and Bannai [7], all using at least 3nlogn bits of working space. Other 



1 The working space excludes the input string, the output factorization, and O(logn) terms. 



candidates for the fastest algorithms are described by Kempa and Puglisi |10| . Due to nearly 
simultaneous publication, no comparison between them exists so far. Experiments in this paper put 
the algorithms of Kempa and Puglisi slightly ahead. Their algorithms are also very space efficient; 
one of them uses 2n log n + n bits of working space and others even less. However, their worstcase 
time complexity is 0(nlog<r) for an alphabet of size a. More details about these algorithms are 
given in Section [2j 

Our contribution. We describe two linear time algorithms for LZ factorization. The first algorithm 
uses 3nlogn bits of working space and can be seen as a reorganization of an algorithm by Goto 
and Bannai [7j- However, this reorganization makes it smaller and faster. In our experiments, this 
is the fastest of all algorithms when the input is not highly repetitive. 

The second algorithm employs a novel combinatorial technique to reduce the working space to 
2nlogn bits, which is at least nlogn bits less than any previous linear time algorithm uses in the 
worstcase. The space reduction does not come at a great cost in performance. The algorithm is the 
fastest on some inputs and not far behind the fastest on others. 

Both algorithms share several nice features. They are alphabet-independent, using only char- 
acter comparisons to access the input. They make just one sequential pass over the suffix array, 
enabling streaming from disk, which would reduce the working space by a further n log n bits. They 
are also very simple and easy to implement. 

2 Preliminaries 

Strings. Throughout we consider a string X = X[l..n] = X[1]X[2] . . . X[n] of |X| = n symbols drawn 
from an ordered alphabet of size a. 

For i = 1, . . . , n we write X[i..n] to denote the suffix of X of length n — i + 1, that is X[i..n] = 
X[i]X[i + l] . . . X[n]. We will often refer to suffix X[i..n] simply as "suffix i" . Similarly, we write X[l..i] 
to denote the prefix of X of length i. We write X[i..j] to represent the substring X[i]X[i + 1] . . . X[j] 
of X that starts at position i and ends at position j. Let \cp(i, j) denote the length of the longest- 
common-prefix of suffix i and suffix j. For example, in the string X = zzzzzipzip, lcp(2, 5) = 1 = \z\, 
and lcp(5,8) = 3 = \zip\. For technical reasons we define lcp(i,0) = lcp(0, i) = for all i. 

Suffix Arrays. The suffix array SA is an array SA[l..n] containing a permutation of the integers 
l..n such that X[SA[l]..n] < X[SA[2]..n] < ••• < X[SA[n]..n]. In other words, SA[j] = i iff X[i..n] 
is the j th suffix of X in ascending lexicographical order. The inverse suffix array ISA is the inverse 
permutation of SA, that is ISA[i] = j iff SA[j] = i. Conceptually, ISA[i] tells us the position of suffix 
i in SA. 

The array <3?[0..n] (see [8]) is defined by 3>[i] = SA[ISA[i] - 1], that is, the suffix is the 
immediate lexicographical predecessor of the suffix i. For completeness and for technical reasons 
we define <&[SA[1]] = and $[0] = SA[n] so that forms a permutation with one cycle. 

LZ77. The LZ77 factorization uses the notion of a longest previous factor (LPF). The LPF at 
position i in X is a pair (pi,£i) such that, p. L < i, X[pi..pi + t- L — 1] = X[i..i + ii — 1] and ti > is 
maximized. In other words, X[i..i + £i — 1] is the longest prefix of X[i..n] which also occurs at some 
position pi < i in X. If X[i] is the leftmost occurrence of a symbol in X then such a pair does not 
exist. In this case we define pi = X[i] and li = 0. Note that there may be more than one potential 
Pi, and we do not care which one is used. 
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The LZ77 factorization (or LZ77 parsing) of a string X is then just a greedy, left-to-right parsing 
of X into longest previous factors. More precisely, if the j'th LZ factor (or phrase) in the parsing is 
to start at position i, then we output (pi,£i) (to represent the jth phrase), and then the (j + l)th 
phrase starts at position i + ii, unless ii = 0, in which case the next phrase starts at position i + 1. 
We call a factor (pi,ii) normal if it satisfies h > and special otherwise. The number of phrases in 
the factorization is denoted by z. 

For the example string X = zzzzzipzip, the LZ77 factorization produces: 

(z,0),(l,4),(i,0),(p,0),(5,3). 
The second and fifth factors are normal, and the other three are special. 

NSV/PSV. The LPF pairs can be computed using next and previous smaller values (NSV/PSV) 
defined as 

NSV| ex [i] = min{j e [i + l..n] | SA[j] < SA[t]} 
PSViexH = max{j e - 1] | SA[j] < SA[i]}. 

If the set on the right hand side is empty, we set the value to 0. Further define 

NSV text [i] = SA[NSV,«[ISA[*]]] (1) 
PSV text [i] = SA[PSV lex [ISA[i]]]. (2) 

If NSViex[ISA[i]] = (PSV|ex[ISA[i]] = 0) we set NSV text [i] = (PSV text [i] = 0). 

If (pi,£i) is a normal factor, then either pi = NSV text [i] or pi = PSV tex t[z] is always a valid choice 
for pi [3]. To choose between the two (and to compute the ii component), we have to compute 
lcp(i, NSVtextKD and lcp(i, PSV text [z]) and choose the larger of the two, see Fig. [TJ 

Algorithm LZ-Factor (i, psv, nsv) 
1: if \cp(i,psv) > lcp(i, nsv) then 
2: ipA) (P sv > lcp(z, psu)) 
3: else 

4: (nsv, lcp(i, nsv)) 

5: if I = then p = X[i] 
6: output factor (p,£) 
7: return i + mstx(£, 1) 

Fig. 1. The basic procedure for computing a phrase starting at a position i given psv — PSV tex t[i] and nsv — NSV te xt[j]- 
The return value is the starting position of the next phrase. 



Lazy LZ Factorization. The fastest LZ factorization algorithms in practice are from recent papers 
by Kempa and Puglisi |10j and Goto and Bannai |7|. A common feature between them is a lazy 
evaluation of LCP values: lcp(i, NSV text [i]) and lcp(z, PSV text [i]) are computed only when i is a 
starting position of a phrase. The values are computed by a plain character-by-character comparison 
of the suffixes, but it is easy to see that the total time complexity is O(n). This is in contrast to 
most previous algorithms that compute the LCP values for every suffix using more complicated 
techniques. The new algorithms in this paper use lazy evaluation too. 
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Goto and Bannai [7J describe algorithms that compute and store the full set of NSV/PSV 
values. One of their algorithms, BGT, computes the NSV text and PSV text arrays with the help of 
the $ array. The LZ factorization is then easily computed by repeatedly calling LZ-Factor. Two 
other algorithms, BGS and BGL, compute the NSV| ex and PSV| ex arrays and use them together 
with SA and ISA to simulate NSV text and PSV text as in Eqs. ([I]) and ([2]). All three algorithms run in 
linear time and they use 3nlogn (BGT), 4nlogra (BGL) and (4n + s) log re (BGS) bits of working 
space, where s is the size of the stack used by BGS. In the worst case s = O(n). The algorithms for 
computing the NSV/PSV values are not new but come from |14j (BGT) and from [3] (BGL and 
BGS). However, the use of lazy LCP evaluation makes the algorithms of Goto and Bannai faster 
in practice than earlier algorithms. 

Kempa and Puglisi [10] extend the lazy evaluation to the NSV/PSV values too. Using ISA and 
a small data structure that allows arbitrary NSV/PSV queries over SA to be answered quickly, 
they compute NSV text [i] and PSV text [z] only when i is a starting position of a phrase. The approach 
requires (2+l/6)n log n bits of working space and 0{n+ zb+ z log(n/6)) time, where b is a parameter 
controlling a space-time tradeoff in the NSV/PSV data structure. If we set b = log re, and given 
z = 0(n/log a n), then in the worstcase the algorithm requires O (re logo") time, and 2nlogre + re bits 
of space. Despite the superlinear time complexity, this algorithm (ISA9) is both faster and more 
space efficient than earlier linear time algorithms. Kempa and Puglisi also show how to reduce the 
space to (1 + e)n log re + re + 0(cr log n) bits by storing a succinct representation of ISA (algorithms 
ISA6r and ISA6s). Because of the lazy evaluation, these algorithms are especially fast when the 
resulting LZ factorization is small. 

3 3n log n-Bit Algorithm 

Our first algorithm is closely related to the algorithms of Goto and Bannai [7], particularly BGT 
and BGS. It first computes the PSV text and NSV te xt arrays and uses them for lazy LZ factorization 
similarly to the BGT algorithm. However, the NSV/PSV values are computed using the technique 
of the BGS algorithm, which comes originally from [3]. The algorithm is given Figure [2j 

Algorithm KKP3 

1: SA[0] «- // bottom of stack 

2: SA[n + 1] <— // empties the stack at end 

3: top // top of stack 

4: for i 1 to n + 1 do 

5: while SA[top] > SA[i] do 

6: NSV t ext[SA[top]] «- SA[i] 

7: PSV text [SA[top]] <- SA[top - 1] 

8: top <— top — 1 // pop from stack 

9: top top + 1 

10: SA [top] <- SA[i] // push to stack 
11: i <- 1 

12: while i < n do 

13: i <- LZ-Factor(i, PSVtext [i], NSV tex t[i]) 

Fig. 2. LZ factorization using 3nlogn bits of working space (the arrays SA, NSV tex t and PSV te xt). 
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The advantages of our algorithm compared to those of Goto and Bannai are: 

1. All of the algorithms of Goto and Bannai use an auxiliary array of size n, either ISA or <£. We 
need no such auxiliary array, which saves both space and time. 

2. Both BGS and our algorithm need a stack whose maximum size is not known in advance and can 
be @(n) in the worst case. BGS uses a dynamically growing separate stack while we overwrite 
the suffix array with the stack. This is possible because our algorithm makes just one pass 
over the suffix array (like BGT but unlike BGS) and the stack is never larger than the already 
scanned part of SA. 

3. Similar to the algorithms of Goto and Bannai, we store the arrays PSV text and NSV text inter- 
leaved so that the values PSV text [i] and NSV text [i] are next to each other. We compute the PSV 
value when popping from the stack instead of when pushing to the stack as BGS does. This 
way PSV t e X t [i] and NSV tex t[«] are computed and written at the same time which can reduce the 
number of cache misses. 



4 2n log n-Bit Algorithm 

Our second algorithm reduces space by computing and storing only the NSV values at first. It 
then computes the PSV values from the NSV values on the fly. As a side effect, the algorithm also 
computes the $ array! 

For t £ [0..n], let X± = {X[i..n] | i < t} be the set of suffixes starting at or before position 
t. Let be $ restricted to Xt, that is, for % G [l..t], suffix 3>t[i] is the immediate lexicographical 
predecessor of suffix i among the suffixes in Xt. In particular, $ n = $. As with the full 3>, we make 
<3? t a complete unicyclic permutation by setting <3?t[i m i n ] = and ^[O] = i max , where i m \ n and i max 
are the lexicographically smallest and largest suffixes in Xt. We also set <3?o[0] = 0. A useful way to 
view $t is as a circular linked list storing Xt in the descending lexicographical order with 3>t[0] as 
the head of the list. 

Now consider computing <!>£ given $t_i. We need to insert a new suffix t into the list, which 
can be done using standard insertion into a singly-linked list provided we know the position. It is 
easy to see that t should be inserted between NSV te xt[i] and PSV text [i]. Thus 



*t[»] 

and furthermore 



't if i = NSVtext [t] 

PSVtext [t] if i = t 
<&t_i[zl otherwise 



PSV text [i] [NSVtext [t]\ • 

The pseudocode for the algorithm is given in Figure El The NSV values are computed essentially 
the same way as in the first algorithm (lines 1-9) and stored in the array In the second phase, 
the algorithm maintains the invariant that after t rounds of the loop on lines 12-18, $[()..£] = 3>t 
and $[t + l..n] = NSVtext^ + l-.n]. 

5 Getting Rid of the Stack 

The above algorithms overwrite the suffix array with the stack, which can be undesirable. First, we 
might need the suffix array later for another purpose. Second, since the algorithms make just one 
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Algorithm KKP2s 

1: SA[0] ^ // bottom of stack 

2: SA[n + 1] // empties the stack at end 

3: top <— // top of stack 

4: for i <— 1 to n + 1 do 

5: while SA[top] > SA[i] do 

6: $[SA[topj] <- SA[i] // *[SA[top]] = NSV te xt[SA[top]] 

7: top top — 1 // pop from stack 

8: top «— top + 1 

9: SA[top] <- SA[z] // push to stack 

10: $[0] <- 

11: next <s— 1 

12: for t 1 to n do 

13: ns?; «- $[t] 

14: pst> <— <E>[nsw] 

15: if t = next then 

16: next LZ-Factor (t,psv, nsv) 

17: $[t] <- psv 

18: $[nsu] <- t 

Fig. 3. LZ factorization using 2nlogn bits of working space (the arrays SA and $). 

sequential pass over the suffix array, we could stream the suffix array from disk to further reduce 
the memory usage. In this section, we describe variants of our algorithms that do not overwrite SA 
(and still make just one pass over it). 

The idea is to replace the stack with PS V text pointers. If j is the suffix on the top of the stack, 
then the next suffixes in the stack are PSV text [j], PSV text [PSV text [?']], etcetera. This can be easily 
seen in how the PSV text values are computed in KKP3 (line 7 in Fig. [2]). Thus given PSV text we do 
not need an explicit stack at all. Both of our algorithms can be modified to exploit this: 

— In KKP3, we need to compute the PSV text values when pushing on the stack rather than when 
popping. The body of the main loop (lines 5-10 in Fig. [2]) now becomes: 

while top > SA[i] do 
NSV text [top] <- SA[i] 

top <— PSV text [top] 

PSVtext [SA[z]] <- top 
top ^— SA[i] 

— KKP2s needs to be modified to compute PSV tex t values first instead of NSV tex t values. The 
PSVtext-fhst version is symmetric to the NSV te xt-fi r st algorithm. In particular, $j is replaced by 
the inverse permutation 3>^~ . The algorithm is shown in Fig. [5J 

The versions without an explicit stack are slightly slower because of the non-locality of the 
pointer accesses. If we need to avoid overwriting SA, a faster alternative would be to use a separate 
stack. However, the stack can grow as big as n (for example when X = a n ~ 1 b) which increases the 
worst case space requirement by n log n bits. 

We can get the best of both alternatives by adding a fixed size stack buffer to the stackless 
version. The buffer holds the top part of the stack to speed up stack operations. When the buffer 
gets full, the bottom half of its contents is discarded, and when the buffer gets empty, it is filled 
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Algorithm KKP2n 

1: top <— // top of stack 

2: for i <- 1 to ii do 

3: while top > SA[i] do 

4: top <— $ _1 [top] // pop from stack 

5: O- 1 [SA [i\] <- top // [SA[i]] = PSV text [SA[i]] 

6: top <— // push to stack 

7: $ _1 [0]<-0 

8: next <— 1 

9: for f f- 1 to n do 

10: psu^$ _1 [t] 

11: ns?; <— $ _1 [psii] 

12: if £ = next then 

13: next «— LZ-Factor(t, psi>, nsv) 

14: <3> _1 [t] <— nsv 

15: $ _1 [pat?] <s— t 

Fig. 4. LZ factorization using 2nlogn bits of working space (the arrays SA and without an explicit stack. The 

SA remains intact after the computation. 

half way using the PSV pointers. This version is called KKP2b. The time complexity remains linear 
and is independent of the buffer size. 

6 Experimental Results 

We implemented the algorithms described in this paper and compared their performance in practice 
to algorithms from [TO] and [7]. Experiments measured the time to compute the LZ factorization 
of the text. All algorithms take the text and the suffix array as an input hence we omit the time 
to compute SA. The data set used in experiments is described in detail in Table [TJ 

Experiments Setup. We performed experiments on a 2.4GHz Intel Core i5 CPU equipped with 
3072KB L2 cache and 4GB of main memory. The machine had no other significant CPU tasks run- 
ning and only a single thread of execution was used. The OS was Linux (Ubuntu 10.04, 64bit) run- 
ning kernel 2.6.32. All programs were compiled using g++ version 4.4.3 with -03 -static -DNDEBUG 
options. For each combination of algorithm and test file we report the median runtime from five 
executions. The times were recorded with the standard C clock function. All data structures reside 
in main memory during computation. 

Discussion. In nearly all cases algorithms introduced in this paper outperform the algorithms 
from [7] (which are, to our knowledge, the fastest up-to-date linear time LZ factorization algo- 
rithms) while using the same or less space. In particular the KKP2 algorithms are always faster 
and simultaneously use at least nlogn bits less space. A notably big difference is observed for 
non-repetitive data, where KPP3 significantly dominates all prior solutions. 

The new algorithms (e.g. KKP2b) also dominate in most cases the general purpose practical 
algorithms from [10J (ISA9 and ISA6s), while offering stronger worst case time guarantees, but are 
a frame slower (and use about 50% more space in practice) than ISA6r for highly repetitive data. 

The comparison of KKP2n to KKP2s reveals the expected slowdown (up to 16%) due to the 
non-local stack simulation. However, this effect is almost completely eliminated by buffering the 
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Table 1. Files used in the experiments. The files are from the standard (S) Pizza&Chili cor- 
pus (http : //pizzachil i . dec .uchile . cl/texts .html] ) and from the repetitive (R) Pizza&Chili corpus 
(http://pizzachili.dcc.uchile.cl/repcorpus.html). The repetitive corpus consists of files containing multiple 
copies of similar data (R), artificially generated sequences (A), and files created from standard corpus by concate- 
nating 100 copies of 1MB prefix and mutating them randomly (PR). The value of n/z (average length of phrase in 
LZ factorization) is included as measure of repetitiveness. 



top part of the stack (KKP2b). With a 256KB buffer we obtained runtimes almost identical to 
KKP2s (< 1% difference in all cases). We observed a similar effect when applying this optimization 
to the KKP3 algorithm but, for brevity, we only present the improvement for KKP2. 

Finally, we observe that KKP2b, despite being slower than KPP3 on non-repetitive data, has 
runtimes that are very close to best in each category, making it perhaps the most applicable general 
purpose algorithm. 



Testfile 


KKP3 KKP2s KKP2b KKP2n 
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proteins 
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92.7 
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123.2 
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english 


75.7 


80.6 


80.6 


84.6 
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83.9 
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108.6 


153.9 
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81.7 
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92.7 


97.3 




175.2 


86.1 


97.5 
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sources 


50.5 


54.7 


54.8 


56.1 




115.0 


59.3 


69.3 


77.8 


99.8 


coreutils 


43.6 


40.2 


40.2 


40.6 


43.3 


49.4 


41.9 


51.5 


52.2 


55.4 


cere 


63.2 


53.3 


53.2 


57.7 


51.8 


56.3 


53.0 


65.5 


66.1 


84.1 


kernel 


45.7 


41.6 


41.5 


42.2 


39.2 


45.7 


42.8 


52.9 


53.0 


56.2 


einstein.en 


56.9 


43.6 


43.5 


47.6 


31.1 


37.1 


45.2 


60.0 


58.6 


52.8 


proteins. 001.1 


52.6 


43.1 


43.1 


50.0 


40.7 


45.3 


46.6 


58.4 


57.6 


59.6 


english.001.2 


52.0 


43.4 


43.8 


52.2 


40.4 


45.1 


45.3 


57.7 


56.0 


79.4 


dna.001.1 


55.6 


43.9 


43.9 


50.8 


39.2 


43.7 


45.0 


58.5 


57.8 


62.5 


sources. 001. 2 


43.1 


40.5 


40.5 


47.8 


38.0 


42.8 


41.1 


54.2 


52.4 


72.2 


tm29 


38.2 


35.1 


35.1 


38.7 


34.2 


39.6 


36.4 


44.1 


44.2 


44.4 


rs.13 


77.8 


49.0 


49.4 


52.0 


34.8 


40.8 


51.8 


59.5 


59.5 


56.5 



Table 2. Times for computing LZ factorization. The times are seconds per gigabyte and do not include any reading 
from or writing to disk. 
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7 Future Work 



For data of low to medium repetitiveness the algorithms introduced in this paper are the fastest 
available. These algorithms should adapt easily to a semi-external setting because, apart from the 
need to permute the NSV/PSV values into text order, which can be handled in external memory, 
all non-sequential memory accesses are restricted to the input string. We are currently exploring 
this direction. 

There are several interesting open problems. One is the need for a fully external memory algo- 
rithm for LZ factorization, especially given the recent pattern matching indexes which use LZ77. 
Relatedly, parallel and distributed approaches are also of high interest. A recent step in the external 
memory direction is [11] . 

Another problem is to find a scalable way to accurately estimate the size of the LZ factorization 
in lieu of actually computing it. Such a tool would be useful for entropy estimation, and to guide 
the selection of appropriate compressors and compressed indexes when managing massive data sets. 

Finally, one wonders if only (1 + e)n log n + 0(log n) bits of working memory is enough for linear 
runtime. The most space-efficient algorithm in this paper use 2nlogn + O(logn) bits, and in |10| 
working space of (1 + e)nlogn + n + O(crlogn) bits (for arbitrary constant e) is achieved, but at 
the price of O(nlogcr) runtime. 

Acknowledgment s 

Our thanks go to Keisuke Goto and Hideo Bannai for sending us a preliminary copy of their 
manuscript. 

References 

1. M. Charikar, E. Lehman, D. Liu, R. Panigrhy, M. Prabhakaran, A. Sahai, and a. shelat. The smallest grammar 
problem. IEEE Transactions on Information Theory, 51(7):2554-2576, 2005. 

2. G. Chen, S. J. Puglisi, and W. F. Smyth. Fast and practical algorithms for computing all the runs in a string. 
In Proc. Symposium on Combinatorial Pattern Matching ( CPM), pages 307-315, 2007. 

3. M. Crochemore and L. Hie. Computing longest previous factor in linear time and applications. Information 
Processing Letters, 106(2) :75-80, 2008. 

4. M. Crochemore, L. Hie, and W. F. Smyth. A simple algorithm for computing the Lempel Ziv factorization. In 
Proc. Data Compression Conference (DCC), pages 482-488, 2008. 

5. T. Gagie, P. Gawrychowski, J. Karkkainen, Y. Nekrich, and S. J. Puglisi. A faster grammar-based self-index. 
In Proc. Conference on Language and Automata Theory and Applications (LATA), LNCS 7183, pages 240-251, 
2012. 

6. T. Gagie, P. Gawrychowski, and S. J. Puglisi. Faster approximate pattern matching in compressed repetitive 
texts. In Proc. Symposium on Algorithms and Computation (ISAAC), pages 653-662, 2011. 

7. K. Goto and H. Bannai. Simpler and faster Lempel Ziv factorization, http://arxiv.org/abs/1211.3642, 2012. 

8. J. Karkkainen, G. Manzini, and S. J. Puglisi. Permuted longest-common-prefix array. In Proc. Symposium on 
Combinatorial Pattern Matching (CPM), LNCS 5577, pages 181-192. Springer, 2009. 

9. J. Karkkainen, P. Sanders, and S. Burkhardt. Linear work suffix array construction. Journal of the ACM, 
53(6):918-936, 2006. 

10. D. Kempa and S. J. Puglisi. Lempel-Ziv factorization: simple, fast, practical. In Proc. Algorithm Engineering 
and Experiments (ALENEX), 2013. In press. 

11. D. Kempa and S. J. Puglisi. Lightweight LZ77 factorization, 2013. Manuscript. 

12. S. Kreft and G. Navarro. Self-indexing based on LZ77. In Proc. Symposium on Combinatorial Pattern Matching 
(CPM), LNCS 6661, pages 41-54, 2011. 

13. G. Navarro and V. Makinen. Compressed full-text indexes. ACM Computing Surveys, 39(l):article 2, 2007. 

14. E. Ohlebusch and S. Gog. Lempel-Ziv factorization revisited. In Proc. Symposium on Combinatorial Pattern 
Matching (CPM), LNCS 6661, pages 15-26, 2011. 



9 



